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Abstract 

This paper presents a new method for learning and tuning a fuzzy 
logic controller based on reinforcements from a dynamic system. In 
particular, our Generalized Approximate Reasoning-based Intelligent 
Control (GARIC) architecture (a) learns and tunes a fuzzy logic con- 
troller even when only weak reinforcements, such as a binary failure 
signal, is available; (b) introduces a new coiyunction operator in com- 
puting the rule strengths of fuzzy control rules; (c) introduces a new 
localized mean of maximum (LMOM) method in combining the con- 
clusions of several firing control rules; and (d) learns to produce real- 
valued control actions. Learning is achieved by integrating fuzzy infer- 
ence into a feedforward network, which can then adaptively improve 
performance by using gradient descent methods. We extend the AHC 
algorithm of Barto, Sutton, and Anderson to include the prior control 
knowledge of human operators. The GARIC architecture is applied 
to a cart-pole balancing system and has demonstrated significant im- 
provements in terms of the speed of learning and robustness to changes 
in the dynamic system’s parameters over previous schemes for cart-pole 
balancing. 


1 Introduction 

The non-linear behavior of many practical systems and unavailability of 
quantitative data regarding the input-output relations makes the analyti- 
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cal modeling of these systems very difficult. On the other hand, approx- 
imate reasoning-based controllers which do not require analytical models 
have demonstrated a number of successful applications such as the subway 
system in the city of Sendai [31], nuclear reactor control [12] and automobile 
transmission control [14]. These applications have mainly concentrated on 
emulating the performance of a skilled human operator in the form of lin- 
guistic rules. However, the process of learning and tuning the control rules 
to achieve the desired performance remains a difficult task. 

Starting with the Self Organizing Control (SOC) techniques of Mam- 
dani and his students (e.g., [23]), the need for research in developing fuzzy 
logic controllers which can learn from experience has been realized (e.g., 
[17]). The learning task may include the identification of the main control 
parameters (better known as system identification in control theory) or de- 
velopment and tuning of the fuzzy memberships used in the control rules. 
In this paper, we concentrate on the latter learning task and develop an 
architecture which can learn to adjust the fuzzy membership functions of 
the linguistic labels used in different control rales. 

Connectionist learning approaches [5] can be used in learning control. 
Here, we can distinguish three classes: supervised learning , reinforcement 
learning, and unsupervised learning. In supervised learning, a teacher pro- 
vides the desired control objective at each time step to the learning system. 
In reinforcement learning, the teacher’s response is not as direct, immediate, 
and informative as in supervised learning and it serves more to evaluate the 
state of the system. The presence of a teacher or a supervisor to provide 
the correct control response is not assumed in unsupervised learning. 

If supervised learning can be used in control (e.g., when the input-output 
training data is available), it has bee n shown that it is more efficien t than 
reinforcement learning (e.g., [6, 1]). However, many control problems re- 
quire selecting control actions whose consequences emerge over uncertain 
periods for which input-output training data are not readily available. In 
such domains, reinforcement learning techniques are more appropriate than 
supervised learning. 

The organization of this paper is as follows. We first review some fun- 
damentals of fuzzy logic control, reinforcement learning, and credit assign- 
ment. Next, we discuss the general architecture for Approximate Reasoning- 
based Intelligent Control (GARIC). This architecture addresses two related 
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problems. First, we introduce techniques for the design of rule-based con- 
trollers which use qualitative linguistic rules obtained from human expert 
controllers. Also, we describe a controller that leams directly from experi- 
ence and automatically develops and adjusts the definitions of its linguistic 
labels. Finally, we describe the application of this architecture to the real- 
world control problem of cart-pole balancing. 


2 Fuzzy Sets and Fuzzy Logic Control 


A fuzzy set, defined originaly by Zadeh [32], is an extension of a crisp set. 
Crisp sets only allow full membership or no membership at all, whereas fuzzy 
sets allow partial membership. In other words, an element may partially 
belong to a set. In a crisp set, the membership or non-membership of an 
element a: in set A is described by a characteristic function Pa(*), where: 




f 1 if * € A 
^ 0 if * £ A. 


Fuzzy set theory extends this concept by defining partial memberships 
which can take values ranging from 0 to 1: 

Pa: X [0, 1] 

where X refers to the universal set defined in a specific problem. 

Assuming that A and B are two fuzzy sets with membership functions of 
Pa and ps, then the following operations can be defined on these sets. The 
complement of a fuzzy set A is a fuzzy set A with a membership function 

MA = i - AM(*)- 

The union of A and I? is a fuzzy set with the following membership function 

PAuB = ma x{pa,Pb}- 
The intersection of A and B is a fuzzy set 

PAnB = min {pa,Pb}- 
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Different methods for developing fuzzy logic controllers have been sug- 
gested in recent years mid are reviewed in [8]. In the design of a fuzzy 
controller, one must identify the main control parameters and determine a 
term set which is at the right level of granularity for describing the values of 
each linguistic variable*. For example, a term set including linguistic values 
such as { Small, Medium, Large } may not be satisfactory in some domains, 
and may instead require the use of a five term set such as {Very Small, 
Small, Medium, Large, and Very Large}. 

Figure 1 illustrates a simple architecture for a fuzzy logic controller. 
The system dynamics of the plant is measured by a set of sensors. This 
architecture consists of four elements whose functions are described next. 

In coding the values from the sensors, one transforms the values of the 
sensor measurements by using the linguistic labels in the rule preconditions. 
This process is commonly called fuzzification or encoding. The fuzzification 
stage requires matching the sensor measurements against the membership 
functions of linguistic labels. 

In modeling the human expert operator’s knowledge, fuzzy control rules 
of the form: 

IF Error is small AND Change- in-error is small THEN Force is small 

can be used effectively when expert human operators can express the heuris- 
tics or the control knowledge that they use in controlling a process in terms 
of rules of the above form. 


2.1 Conflict Resolution and Decision Making 

As mentioned earlier, due to the partial matching attribute of fuzzy control 
rules and the fact that the preconditions of rules do overlap, more than one 
fuzzy control rule can fire at a time. The methodology which is used in 
deciding what control action should be taken as the result of the firing of 
several rules can be referred to as conflict resolution. The following example, 
using two rules, illustrates this process. Assume that we have the following 
rules: 

*A linguistic variable is a variable which can only take linguistic values. 
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Rule 1: IF X is A x and Y is B x THEN Z is C x 
Rule 2: IF X is A 2 and Y is B 2 THEN Z is C 2 


Each rule has an antecedent or if part containing several preconditions, and 
a consequent or then part which prescribes the value of one or more output 
actions. Now, if we have zo and yo as the sensor readings for fuzzy variables 
X and y, then their truth values axe represented by MAi(*o) and ^Bj(yo) 
respectively for Rule 1, where pa x and hb x represent the membership func- 
tion for A x and B X} respectively. Similarly for Rule 2, we have fiA 2 ( x o) and 
HB 2 {y o) as the truth values of the preconditions. 

Wi = fi Bl (yo)) 


Similarly for Rule 2: 


«>2 = (*o)» MBj(yo))- 

where A denotes a conjunction or intersection operator. Traditionally, fuzzy 
logic controllers use a minimum operator for A. However, here we use a 
softmin operator which produces the same result in the limit but in general 
is not as specific as the minimum operator is. The reason for this is differ- 
entiability, which we need for learning purposes. This will be dealt with in 
greater detail later. 

Using the softmin, the strength of Rule 1 can be calculated by: 


Wi 


PaAx o)e-W»«) + /i Bl ( Sto)e-W“»> 


e ~ k ^Ai (*o) (*») 


Similarly for Rule 2: 


w _ /M a (*o)e k “ A 3 (xo} + ^fl a (yo)e~ fcM ^ (wt>) 

2 VO) 

The control output of rule 1 is calculated by applying the matching strength 
of its preconditions on its conclusion. Assuming that 


and for Rule 2: 

*2 = 
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In this paper, we introduce a new defuzzification procedure to compute the 
expression ^ -1 (w) which is explained later. The above equations show that 
as a result of reading sensor values x 0 and yo, Rule 1 is recommending a 
control action z\ and Rule 2 is recommending a control action z^. The 
combination of the above rules produces a nonfuzzy control action z* which 
is calculated using a weighted averaging approach: 

_• _ E?=l 
E?=i 

where n is the number of rules, and z+ is the amount of control action 
recommended by rule i. A similar procedure can be used for multiple output 
variables in the consequents. 

3 Reinforcement Learning 

In reinforcement learning, one assumes that there is no supervisor to criti- 
cally judge the chosen control action at each time step. The learning system 
is told indirectly about the effect of its chosen control action. The study of 
reinforcement learning relates to credit assignment where, given the perfor- 
mance (results) of a process, one has to distribute reward or blame to the 
individual elements contributing to that performance. This may be further 
complicated if there is a sequence of actions, which is collectively awarded 
a delayed reinforcement. In rule-based systems, for example, this means 
assigning credit or blame to individual rules (or their parts) engaged in the 
problem solving process. Samuel’s checkers-playing program is probably the 
earliest AI program which used this idea [25]. Michie and Chambers [19] used 
a reward-punishment strategy in their BOXES system, which learned to do 
cart-pole balancing by discretizing the state space into non- overlapping re- 
gions (boxes) and applying two opposite constant forces. Barto, Sutton, and 
Anderson [4] used two neuron-like elements to solve the learning problem in 
cart-pole balancing. In these approaches, the state-space is partitioned into 
non-overlapping smaller regions and then the credit assignment is performed 
on a local basis. 

Reinforcement learning has its roots in studies of animal learning and 
research on human behavior (e.g., [3]). It directly relates to the theory of 
learning automata initiated by the work of Tsetlin [28] and further devel- 
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oped by the work of Narendra and Thathachar [22], Narendra and Laksh- 
mivarahan [21] , and Mendel and McLaren [18] in control engineering. Since 
reinforcement learning techniques do not use an explicit teacher or supervi- 
sor, they construct an internal evaluator or a critic capable of evaluating the 
dynamic system’s performance. The construction of this critic so that it can 
properly evaluate the performance in a way which is useful to the control 
objective, is itself a significant problem in reinforcement lear nin g. Given 
the evaluation by the critic, the other problem in reinforcement learning is 
how to adjust the control signal. Barto [5] discusses several approaches to 
this problem based on the gradient of the critic’s evaluation as a function of 
control signals. 


Temporal Difference methods Related to reinforcement learning are 
the Temporal Difference (TD) methods, a class of incremental learning pro- 
cedures specialized for prediction problems, which have been introduced by 
Sutton [27]. The main characteristic of these methods is that they leam from 
successive predictions whereas in the case of supervised learning methods, 
learning occurs when the difference between the predicted outcome and the 
actual outcome is revealed (i.e., the learning model in TD does not have to 
wait until the actual outcome is known and can update its parameters within 
a tried period). The difference between the Temporal Difference methods 
and the supervised learning methods becomes clear when these methods are 
distinguished as single-step versus multi-step prediction problems. In the 
single-step prediction (e.g., Widrow-Hoff rule [29]), complete information 
regarding the correctness of a prediction is revealed at once. However, in 
multi-step prediction, this information is not revealed until more than one 
step after the prediction is made, but partial information becomes available 
at each step. Barto et. al. have recently shown stronger relation between a 
specific class of these methods called TD algorithm and dynamic program- 
ming [7]. 

ARIC Architecture The Approximate Reasoning-based Intelligent Con- 
trol (ARIC) architecure has been proposed in [10]. This architecture extends 
Anderson’s method [1] by including the prior control knowledge of expert 
operators in terms of fuzzy control rules. In ARIC, a neural network is used 
to perform action and state evaluations. Also, two coupled neural networks 
are used to select a control action at each time step where the first network 
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uses fuzzy inference to recommend an action and the second network calcu- 
lates a degree to which the action recommended by the first network should 
be modified. The ARIC architecture tunes its fuzzy controller through up- 
dating the weights on the links in these networks. As this learning proceeds, 
the action recommended by the fuzzy controller is followed more often. Only 
monotonic membership functions are used in ARIC and the fuzzy labels used 
in the control rules are adjusted locally within each rule. However, in the 
architecture presented next, we provide an algorithm to time the fuzzy la- 
bels globally in all the rules find allow any type of differentiable membership 
function to be used in the construction of a fuzzy logic controller. 


4 The GARIC architecture 

Oux system will determine a control action by using a neural network which 
implements fuzzy inference. In this way, prior expert knowledge can be 
easily incorporated. This knowledge is allowed to be faulty or damaged. 
Another neural net will learn to become a good evaluator of the current 
state and will serve as an internal critic. Both networks will adapt their 
weights concurrently so as to improve performance. 

The architecture of GARIC is schematically shown in Figure 3. 

It has three components: 

• The Action Selection Network (ASN) maps a state vector into a rec- 
ommended action F, using fuzzy inference. 

• The Action Evaluation Network (AEN) maps a state vector and a 
failure signal into a scalar score which indicates state goodness. This 
is also used to produce internal reinforcement f. 

• The Stochastic Action Modifier (SAM) uses both F and f to produce 
an action F * which is applied to the plant. 


The ensuing state is fed back into the controller, along with a boolean failure 
signal. Learning occurs by fine-tuning of the free parameters in the two 
networks : in the AEN, the weights are adjusted; in the ASN, the parameters 
describing the fuzzy membership functions change. 
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4.1 The Action Evaluation Network 


The AEN plays the role of an adaptive critic element (ACE) [4] and con- 
stantly predicts reinforcements associated with different input states. The 
only information received by the AEN is the state of the physical system in 
terms of its state variables and whether or not a failure has occurred. 

The AEN is a standard two-layer feedforward net with sigmoids every- 
where except in the output layer. The input is the state of the plant, and 
the output is an evaluation of the state (a score), denoted by v. This v- 
value is suitably discounted and combined with the external failure signal 
to produce internal reinforcement f as explained before. 

The structure of an evaluation network includes h hidden units and n 
input units from the environment, and a bias unit (i.e., zo> z n ). In 
this network, each hidden unit receives n + 1 inputs and has n + 1 weights, 
while each output unit receives n + h + 1 inputs and has n + h + 1 weights. 
This structure is shown in Figure 4. The learning algorithm is composed of 
Sutton’s AHC algorithm [26] for the output unit and error back-propagation 
algorithm [24] for the hidden units. 

The AEN produces a prediction of future reinforcement for a given state, 
and the changes in this prediction are used to guide the SAM in selecting 
actions. For example, if we move from a state with prediction of low rein- 
forcement to a state with prediction of higher reinforcement, this positive 
change, also called heuristic or internal reinforcement, is used to reinforce 
the selection of the action which caused this move. 

The output of the units in the hidden layer is: 

n 

yi[t, t + 1] = + 1]) (1) 

j=i 

where 

*w-r»rv (2) 

and t and t + 1 are successive time steps. The output unit of the evaluation 
network receives inputs from both units in the hidden layer (i.e., y,) and 
directly from the units in the input layer (i.e., z<): 

n h 

v[t, t + 1] = £ 6<[t]*i[t + 1] + £ *[<]*[«, t + 1] (3) 

»=i t=i 
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where v is the prediction of reinforcement . In the above equations (and the 
equations which follow), double time dependencies are used to avoid insta- 
bilities in the updating of weights [2]. For example, in the above equation, 
the weights at time t are multiplied by the z £ s at time i + 1 . If the same time 
index is used, then we can not detect whether the change in v was caused by 
the change in the weights (i.e., b{ and c«) or it was caused by the change in 
the state of the system (i.e., Zj). Writing the equation as shown above with 
different time steps allows us to compare different v’s over times and notice 
whether the system has moved to a better state (i.e., higher reinforcement) 
or to a worse state (i.e., lower reinforcement). 

This network evaluates the action recommended by the action network 
as a function of the failure signed and the change in state evaluation based 
on the state of the system at time t + 1: 

{ 0 start state ; 

r[t + 1] — v[t, f] failure state; (4) 

r[f + 1] + 7 «[t, t + 1] — v[t, t] otherwise 

where 0 < 7 < 1 is the discount rate. In other words, the change in the value 
of v plus the value of the external reinforcement constitutes the heuristic or 
internal reinforcement f where the future values of v are discounted more, 
the further they are from the current state of the system. For example, 
the value of v generated one time step later is given less weight than the 
the current value of v. This method of estimating reinforcement gives an 
approximate exponential trace of v, where the series is truncated after two 
terms. 

4.2 Action Selection Network 

Given the current state of the plant, this network selects an action by im- 
plementing an inference scheme based on fuzzy control rules as explained in 
section 2 . It can be represented as a network with 5 layers of nodes, each 
layer performing one stage of the fuzzy inference process (see Figure 5). The 
connections are feedforward, with each node performing a local computa- 
tion. However, this computation may be different from the conventional 
weight ed-sum-of-inputs . 
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Layer 1 is the input layer, consisting of the real- valued input variables. 
These can also be thought of as the linguistic variables of interest. No com- 
putation is done at these nodes. 


A Layer 2 node corresponds to one possible value of one of the linguistic 
variables in Layer 1, e.g. if large is one of the values that x can take, a node 
computing fn arge (z) belongs to layer 2. It will have exactly one input, and 
will feed its output to all the rules using the clause: if x is large, in their if 
part. The function is given by 


ftv, 


*VL.*VR 


(•) 


where V indicates a linguistic value (e.g. large), and c,S£„ sr correspond to 
the center, left spread and right spread of the fuzzy membership function of 
label V. cy serves as a reference point (the mode), and the spreads charac- 
terize length scales on either side of the center, thus permitting asymmetry. 
More parameters may be included if desired. An instance of a smooth mem- 
bership function is 

1 

1 + 

where s = svl or $vr accordingly as*<cor*>e and b controls the 
curvature. For triangular shapes, this function is given by 


te,n .»*(*) 


1 - - c\/sr, x 6 [e,c + #jt] 

< 1 — j* — cj /S£, *e[c-S£,c) 

0 otherwise 


( 5 ) 


Triangular shapes are to be preferred because they are simple and have 
been proven to be sufficient in scores of application domains. The center 
and spreads may be considered as weights on the input links, analogous to 
the approach taken with radial-basis- function units in neural networks [20]. 


Layer 3 implements the conjunction of all the antecedent conditions in a 
rule. A node in layer 3 corresponds to a rule in the rule-base. Its inputs 
come from all nodes in Layer 2 which participate in the t/part of that rule. 
The node itself performs the min operation, which we have softened to the 
following continuous, differentiable softmin operation: 


Or 3 = w r 


£i*e-** 

Eie-** 


( 6 ) 


11 


Here, & is the degree of match between a fuzzy label occurring as one of the 
antecedents of rule r, and the corresponding input variable. This softmin 
operation gives w r , the degree of applicability of Rule r. The parameter k 
controls the hardness of the softmin operation, and as k -* oo, we recover 
the usual min operator. However, for k finite, we get a differentiable function 
of the inputs, which makes it convenient for calculating gradients during the 
learning process. The choice of k is not critical. 

A Layer 4 node corresponds to a consequent label. Its inputs come 
from all rules which use this particular consequent label. For each of the 
w r supplied to it, this node computes the corresponding output action as 
suggested by rule r. This mapping may be written as 

Pc V ,* VI r ' VR ( Wr ) 

where V indicates a specific consequent label, parameterize the 

membership function as before, and the inverse is taken to mean a suitable 
defuzzification procedure applicable to an individual rule. In general, the 
mathematical inverse of fi may not exist if the function is not strictly mono- 
tonic. We propose a simple procedure to determine this inverse: if w r is the 
degree to which Rule r is satisfied, 

Hy l {w T ) is the X-coordinate of the centroid of the set {x : t*v(x) = «v} 

This is similar to the Mean-of- Maximum method of defuzzification [8], 
but the latter is applied after all rule consequents have been combined, 
whereas we apply it locally, to each rule, before the consequents are com- 
bined. We will refer to this variation as the LMOM (Local Mean-of- Maximum) 
method* (see Figure 6). 

For triangular functions, LMOM gives 

Pi v,»vl,»vrM = c v + “ *vl){1 - w r ) (7) 

For the case w r — 0, the limiting value of — ► 0+) is used (which 

is cy + (*vl + *vr)/ 2). It is easy to see that the set /i -1 ([0,l]) is the 

* Although LMOM wa* independently derived in our work, but we were referred to 
Yager’s level set method[30] later on by a reviewer of this paper. The LMOM and level 
set methods are similar in nature although Yager [30] does not discuss the case for skewed 
and convex fuzzy sets in any details. 
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projection of the median of the triangular membership function on the X- 
axis. If the membership function is monotonic, then fi~ 1 (w r ) is just the 
standard mathematical inverse, with appropriate limiting values. 

The unusual feature of a unit in Layer 4 is that it may have multiple out- 
puts carrying different values, since sharing of consequent labels is allowed. 
For each rule feeding it a degree, it should produce a corresponding output 
action which is fed to the next layer. However, this nonstandard feature 
can be eliminated for many classes of membership functions. For triangular 
functions, such a node needs to output only the value 

Ov 4 = ( CV + ^( J VR - “ ^( a V7l ~ J vx)(5^w*) (8) 


In general, whenever /i -1 (z) is polynomial in z, only one output is sufficient, 
regardless of the number of inputs. This transformation is possible because 
of the form of the computation done in the next layer. 


•Layer 5 will have as many nodes as there are output action variables. Each 
output node combines the recommendations from all the fuzzy control rules 
in the rulebase, using the following weighted sum, the weights being the rule 
strengths: 


n _ Er^r H 

F ~ E,«v 


( 9 ) 


By taking advantage of the transformation used in layer 4, this may be 
rewritten as 


f = 2l± 

UrOr 3 


( 10 ) 


where the inputs come from Layer 3 and Layer 4. The node simply sums 
up each set of inputs and takes their quotient. This delivers a continuous 
output variable value which is the action selected by the ASN. F will always 
be defined if each dimension of the input space is completely covered by the 
antecedent label functions. 


Modifiable weights are present on input link* into Layer 2 and 4 only. 
The other weights are fixed at unity. This means that the gradient descent 
procedure effectively works on only two layers of weights, rather than all 
five. 
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4.3 Stochastic Action Modifier 


This uses the values of f from the previous time step and the action F rec- 
ommended by the ASN to stochastically generate an action F' which is a 
gaussian random variable with mean F and standard deviation <r(f(t - 1)). 
This <r() is some nonnegative, monotone decreasing function, e.g. exp(— f). 
The action F' is what is actually applied to the plant. The stochastic per- 
turbation in the suggested action leads to a better exploration of state space 
and better generalization ability. The magnitude of the deviation \F' - F| is 
large when r is low, and small when the interned reinforcement is high. The 
result is that a large random step away from the recommendation results 
when the last action performed is bad, but the controller re ma ins consistent 
with the fuzzy control rules when the previous action selected is a good 
one. The actual form of the function <r(), especially its scale and rate of 
decrease, should take the units and range of variation of the output variable 
into account. 


The perturbation at each time step is denoted 


s(t) = 


F(t) - F(t) 
<r(f(t - 1)) 


( 11 ) 


and is simply the normalized deviation from the ASN-recommended action. 
This will contribute as a learning factor in the ASN. 


5 Learning Mechanisms 

5.1 Learning in AEN 

Weight-updating in this network is similar to a reward/punishment scheme 
for neural networks. If positive (negative) internal reinforcement? are re- 
ceived, the values of the weights are rewarded (punished) by being changed 
in the direction which increases (decreases) its contribution to the total sum. 
The weights on the links connecting the units in the input layer directly to 
the units in the output layer are updated according to the following: 

&i[< + 1] = bi[t] + 0f[t + l]*<[t] (12) 

where 0 > 0 is a constant and f [f + 1] is the internal reinforcement at time 
t + 1. 
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Similarly, for the weights on the connections between the hidden layer 
and the output layer, we have: 

Ci[t + 1] = a[t] + pr[t + 1 ]yi[t, t] (13) 

The weight update function for the Sudden layer is based on a modified 
version of the error back-propagation algorithm [24]. Since no direct error 
measurement is possible (i.e., knowledge of correct action is not available), 
as in Anderson [1], f plays the role of an error measure in the update of the 
output unit’s weights: if f is positive, the weights are altered so as to increase 
the output v for positive input, and vice versa. Therefore, the equation for 
updating the weights is 

Oij[t + 1] = ay[t] + far[t + l]y<[«, t](l - t])«gn(ci[t])*,[t] (14) 

where fa > 0. Note that in the above equation, the sign of a hidden unit’s 
output weight, rather, than its value, is used. This variation is based on 
Anderson’s empirical results that the algorithm is more robust if only the 
sign of the weight is used rather than its value. 


5.2 Learning in ASN 


The ASN is a map from input to output space, denoted _F p (x). Here, p 
is the vector of all the weights in the network, which includes the centers 
and spreads of all antecedent and consequent labels used in the fuzzy rules. 
The intent of computing F is to maximize v, so that the system ends up 
in a good state and avoids failure. Hence, v is the objective function which 
needs to be maximized as a function of p, given the state. This can be 
done by gradient descent, which estimates the derivative dv/dp, and uses 
the learning rule 

dv dv dF 

A ’’ = ’ , di = ’’dF-^ (15) 

to adjust the parameter values. To do this, we need the two derivatives on 
the right hand side, which in general, will depend on the state. 


Even though F is directly dependent on p, the dependence of v on F 
is quite indirect. Each application of the force F is state-specific, and the 
new state depends in a complicated way on the dynamics of the plant. In 
addition, the transfer function of the AEN has to be taken into account to 
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see how the change in state affects v. Since part of this is unknown, and 
part of it is computationally complex, we have made the approximation that 
dv/dF can be computed by the instantaneous difference ratio 

M ±L ~ v (*) ~ ~ *) (i&\ 

dF dF F(t) - F(t - 1) ' o; 

Since this ignores the change in state between successive time steps, it is 
a very crude estimator of the derivative. We will therefore only use its 
sign, and not its magnitude. Of course, the existence of the derivative is an 
implicit assumption as well. 

The other term dF/dp is much more tractable. Since F is known and 
differentiable, a few applications of the chain rule through the 5 layers of 
the ASN give the following set of learning rules. In what follows, Con (Rj) 
and Ant(fZj) are the consequent label and antecedent labels used by Rule j. 
A label V is parameterized by py, which may be one of center, left spread 
or right spread. 

For consequent labels V with parameters pv, with z standing for p~ l , 
the action F is linear in py, but nonlinear in Substituting for z* using 
(7), and differentiating 


F = 



(17) 

3 

? 

II 

C V + o( SvR ~ 

*vl)( 1 - «v) 

(18) 

dF 
d pv 

1 dzy 

^ Wi V=^(Rj) W,d P V 

(19) 

dzy 

dcy 

1 


(20) 

dzy 

dsvR 



(21) 

dzy 

diVh 

-5< 1_ “ v > 


(22) 


These derivatives can be combined to compute If only consequent 

labels are to be tuned, this is all that needs to be calculated. In many 
problems, this may be sufficient as well, since some error in the specification 
of antecedent labels can be compensated for by modifying the consequent 
labels. 
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For antecedent labels, the calculations proceed similarly. The action 
depends on the degrees tu r , which in turn depend on the membership degrees 
fjLi generated in layer 2. 


dF _ w r z(w r ) + z f (w r ) - F 
dw r 'Ei w i 

dw r _ e~ fc<i >(l + kjwr 
dHj Zi e ~ kw 


dF 

dpv 


£ 

VeAnt(R k ) 


dF dwk 

dwjt dfiv 


where z'(tc/ r ) is the derivative with respect to w r . 


(23) 

(24) 

(25) 


These are the variables controlled by the parameters of the antecedent 
labels. 


dF _ dF dfiv 
dpv dfiy dpv 

„ flC n( U(f) ~ ~ 1} ) 

dF m F{t) - F(t - iy 


(26) 

(27) 


The above derivatives can now be combined to get the gradient. 

dv _ dv dF 
dpv ~ dF dpv 


(28) 


An appropriate multiplicative learning rate factor is used with this estima- 
tion of the gradient. This consists of the perturbation s(t) computed by 
the Stochastic Action Modifier, and the internal reinforcement f generated 
by the AEN, in addition to a constant rf t which is set to a small positive 
value. The reason for using s(t)f(t) as a learning factor is that if a large 
perturbation results in a good action, then there should be an extra reward 
given to the weights, since probabilistic search has really helped the system 
in this case. Conversely, if the large random deviation is not beneficial, then 
it should have minim al effect on the weights. 


Since we are interested in maximizing v, 


Ap v (*) = V9(t)f(t)-^- (29) 

is the learning equation. The derivatives can be computed locally by each 
node after receiving relevant values backpropagated through the network. 
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The only nodes whose weights will change are the ones in layer 2 and 4* All 
other edges have weights fixed at 1. 

A word about the existence of these derivatives. If the /*() used in layer 
2 are differentiable everywhere, then all the relevant derivatives will exist. 
However, for triangular membership functions, the derivative does not exist 
at three points, since the two limits are not equal at these points. The 
formally rigorous way to handle this is to consider the convex combination 
of all the gradients at the singular point, and to pick the one direction 
from this set that benefits the optimization algorithm most. A heuristic 
approximation to this scheme is to use an average of the two limits for 
the derivative at the singular points. We have chosen the simpler heuristic 
approach. Note that such a problem does not arise in layer 4 functions, since 
the IrMOM method to compute p~ l () gives a differentiable function, even 
if the corresponding p is triangular in shape. 

The other potential problem with derivatives in gradient descent meth- 
ods is flat spots. When is 0 because the inputs lie outside the range of 
fiv, then no learning will occur for pv- Strictly speaking, this is reasonable 
since V played no role in determining the action for this particular input. 
However, if the input data is confined to a portion of the input space such 
that V does not play any role at all, then the parameters controlling V will 
not be modified. In other words, the system will fail to generalize over parts 
of input space where there is little or no data available. This problem is 
partially avoided by using the Stochastic Action Modifier, which randomly 
perturbs the action performed so that the state trajectory of the system will 
not remain confined to a region of small volume. In our experiments, we 
have also used random starting configurations after a failure occurs. This 
removes sensitive dependence of the learning system on initial conditions. 

Slow learning may also occur because the process is caught in a narrow 
ravine with a gradually sloping bottom (as is known to happen with gradient 
descent methods in neural networks). This can be avoided by use of a 
momentum term [24], or some sort of linesearch technique to deter min e 
the optimal step size at each point [15]. We have not used any of these 
methods in our simulations because we did not encounter prohibitively slow 
learning. However, since a standard gradient descent is being used, any of 
these variations and additions to speed it up can always be used. 
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6 The Cart-Pole Balancing Problem 


We now apply the GARIC approach to solve an interesting control problem. 
In this problem a pole is hinged to a motor-driven cart which moves on 
rail tracks to its right or its left. The pole has only one degree of freedom 
(rotation about the hinge point). The primary control tasks are to keep the 
pole vertically balanced and keep the cart within the rail track boundaries. 

Four state variables are used to describe the system status, and one 
variable represents the force applied to the cart. These are: 

• x: horizontal position of the cart; 

• x : velocity of the cart; 

• 9: angle of the pole with respect to the vertical line; 

• 0: angular velocity of pole 9; 

• f: force applied to the cart. 

The dynamics of the cart-pole system are modeled by the following non- 
linear differential equations [4]: 

„ g sin0 + cos 9 [ ^ 

^ /fi _ m cog 3 $ 1 

*•3 m e +m J 

+ ml[9 3 sin 9 — 9 cos 8] — fi c sgn(x) 
m c + m 

where g is the acceleration due to gravity, m e is the mass of the cart, m is 
the mass of the. pole, l is the half-pole length, fi c is the coefficient of friction 
of cart on track, and /ip is the coefficient of friction of pole on cart. These 
equations were simulated by the Euler method, which uses an approximation 
to the above equations, and a time- step of 20 msec. 

We assume that a failure happens when |0| > 12 degrees or \x\ > 2.4 
meters. However, we later show that the system learns even when the these 
two bounds are tightened. Also, we assume that the equations of motion 
of the cart-pole system are not known to the controller and only a vector 
describing the cart-pole system’s state at each time step is known. In other 
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Table 1: The membership functions: 14 labels for the antecedent and 9 


5els in t 

tie consequent 

Label 

Center 

Left 

spread 

Right 

spread 

Label 

Center 

Left 

spread 

Right 

spread 

POl 

0.3 

0.3 

-1 

PL 

20.0 

5.0 

-1.0 

ZE1 

0.0 

0.3 

0.3 

PM 

10.0 

5.0 

6.0 

NE1 

- 0.3 

-1 

0.3 

PS 

5.0 

4.0 

5.0 

VS1 

0.0 

0.05 

0.05 

PVS 

1.0 

1.0 

1.0 

P02 

1.0 

1.0 

-1.0 

NL 

- 20.0 

-1.0 

5.0 

ZE2 

0.0 

1.0 

1.0 

NM 

- 10.0 

6.0 

5.0 

NE2 

- 1.0 

-1.0 

1.0 

NS 

- 5.0 

5.0 

4.0 

VS2 

0.0 

0.1 

0.1 

NVS 

- 1.0 

1.0 

1.0 

P03 

0.5 

0.5 

-1.0 

ZE 

0.0 

1.0 

1.0 

NE3 

- 0.5 

-1.0 

0.5 





P04 

1.0 

1.0 

-1.0 





NE4 

- 1.0 

-1.0 

1.0 





PS4 

0.0 

0.01 

1.0 





NS4 

0.0 

1.0 

0.01 






words, the cart-pole arrangement is treated as a black box by the learning 
system. 

Figure 7 presents the GARIC architecture as it is applied to this problem. 
The AEN network has 4 input units, a bias input unit, 5 hidden units and 
an output unit. The input state vector is normalized, so that the pole and 
cart positions lie in the range [0, 1]. The velocities are also normalized, but 
they are not constrained to lie in any range. The 35 weights of this net 
are initialized randomly to values in [-0.1, 0.1]. The learning rate for these 
weights is fixed at 0.3. The external reinforcement (i.e., the failure signal r) 
is received by the AEN and used to calculate the internal reinforcement f . 
The discount factor 7 used in this calculation is 0.9. 

6.1 The Action Selection Network 

The fuzzy control rules used to balance the pole successfully are shown 
in Table 6.1 and explained below. These completely determine the width of 
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P01 

P02 

null 

null 

PL 

P01 

ZE2 

null 

null 

PM 

P01 

NE2 

null 

null 

ZE 

ZE1 

P02 

null 

null 

PS 

ZE1 

ZE2 

null 

null 

ZE 

ZE1 

NE2 

null 

null 

NS 

NE1 

P02 

null 

null 

ZE 

NE1 

ZE2 

null 

null 

NM 

NE1 

NE2 

null 

null 

NL 

VS1 

VS2 

P03 

P04 

PS 

VS1 

VS2 

P03 

PS4 

PVS 

VS1 

VS2 

NE3 

NE4 

NS 

VS1 

VS2 

NE3 

NS4 

NVS 


. Figure 1: The 13 rules used with 7 labels for force. 

each layer in the ASN. There are 4 inputs, 14 units in layer 2 (the number 
of antecedent labels), 13 units in layer 3 (the number of rules), 9 units in 
layer 4 (the number of consequent labels) and finally, one output unit to 
compute the force. The initial definitions of all the labels are also shown in 
the table. These directly translate into the initial weights of Layers 2 and 4 
in the Action Selection Network. 

The design of the rule base for this fuzzy controller follows the algorithm 
developed in [9, 11] which is based on a hierarchical process which considers 
the interaction of multiple goals. 

As mentioned earlier, the rule base of a fuzzy controller consists of rules 
which are described using linguistic variables. As shown in Figure 8(a) and 
Figure 8(b), four labels are used here to linguistically define the value of the 
state variables: Positive (PO), Very Small (VS), Zero (ZE), and Negative 
(NE). Nine labels are used to linguistically define the force value recom- 
mended by each control rule: Positive Luge (PL), Positive Medium (PM), 
Positive Small (PS), Positive Very Small (PVS), Zero (ZE), Negative Very 
Small (NVS), Negative Small (NS), Negative Medium (NM), and Negative 
Luge (NL). The forwud calculations in this network ue based on fuzzy 
logic control as described eulier. Nine fuzzy control rules were written for 
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balancing the pole vertically and four control rules were used in positioning 
the cart at a specific location on the rail tracks [11]. These rules are shown 
in Table 6.1. In Figure 7, the presence of a link between an input unit j and 
a unit i in the hidden layer indicates that the linguistic value of the input 
corresponding to unit j is used as a precondition in rule t. The first nine 
rules, corresponding to the hidden layer units 1 to 9, are rules with two pre- 
conditions (i.e., 6, and 9). The rules 10 through 13 have four preconditions 
representing the linguistic values of 9, 9, x, and x. 

For any particular control problem using the GARIC architecture, the 
fuzzy rules and their initial shapes and definitions need to be set up. We 
have used triangular membership functions for all antecedent and conse- 
quent labels. This choice is general enough to be applicable to many other 
problems besides cart-pole balancing. There are 13 rules for this 4-input 
system, and they use 23 linguistic labels in all. The spreads of a fuzzy mem- 
bership function lie in the range (0, oo). If a spread is oo, this parameter will 
not be changed during learning, and the defuzzification procedure (LMOM) 
will work by inverting the non-constant portion. In addition, the softmin 
parameter k is set at a value of 10, and the learning rate 17 is 0.01. 

The labels said rule descriptions are presented in Figure 8. Given the 
rule base, the parameters may be thought of as a means of controlling the 
meaning of the linguistic terms. When the parameters change, this meaning 
is being tuned to be consistent with the rules, such that good performance 
results. In fact, performance is the only objective criterion of “correctness” 
of the label definitions, in the context of the fixed rulebase. 

6.2 Results 

A trial in our experiments refers to starting with the cart-pole system set to 
an initial state and ending with the appearance of a failure signal or success- 
ful control of the system for an extended period*. The default parameters 
for the simulations are: half-pole length .5 m; Pole mass 0.1 Kg; Cart mass 
1.0 Kg; learning rate in the consequents .001. The starting configuration af- 
ter each failure was varied in numerous ways including randomly. The rules 
and starting label descriptions were varied by large amounts. The damages 

1 We say that the system has learned to control the cart-pole if no failure is observed 
before 100 000 time steps. This time corresponds to about 33 minutes of real time. 
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to the labels which are the variations from labels original definitions, as well 
as changes in the parameters, are described with each figure. A starting 
position of 0.1, for example, implies that all 4 state variables were set to 
0.1 after each failure. A randomized start means that after each failure, the 
initial configuration (all 4 parameters) were independently and randomly 
chosen. In the graphs, each curve shows the value of a state variable and 
is in four pieces. The first and second pieces show this value for the first 
few time steps of the first and second trials respectively. If the trial lasted 
less than 300 time steps, then the entire trial is shown, but if not, only the 
first 300 time steps sue shown. The third and fourth pieces of the curve 
show the first 300 and last 300 (from 99700 to 100000) time steps of the last 
(successful) tried, when the experiment was terminated. Of course, failure 
occurs whenever 9 or * exceed their respective bounds. 

Figure 10 shows the performance of the controller during the learning 
process. This is to clearly demonstrate how the membership functions are 
shifted to the correct place by learning. In this experiment, we shifted the 
center of the membership function for ZE by 5 N (this is shown in the figure’s 
caption by ZE +5). The system learned to shift it back to about 0 as shown 
in Figure 9. This change is sufficient for success, given the robustness of 
the fuzzy inference process. Other labels were also shifted by about 1 N, 
which is minimal change. The start state was non-random. Modifications 
to all force labels are shown in Table 2. Figures 11, 12, 13, 14, 15, and 16 
illustrate the performance of the learning system under different scenarios 
which are described in the figure captions. 


6.2.1 Additional Experiments 

Two additional sets of experiments were performed. In the first set, we var- 
ied the number of labels for force from 9 to 7 and redefined their membership 
functions as shown in Table 3. Figures 17, 18, 19, 20, 21, 22, and 23 show 
the results of further experiments using the new membership functions with 
the rules which are shown in table 4. 

Further experiments were performed using 9 modified labels for force as 
shown in table 5. The following table summarizes the results of these runs. 
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Table 2: Force labels after learning 


Label 

Center 

Left 

spread 

Right 

spread 

PL 

19.89 

5.10 

-1.00 

PM 

8.25 

5.84 

5.16 

PS 

6.73 

3.99 

5.80 

PVS 

1.08 

0.99 

1.01 

NL 

-20.29 

-1.00 

4.71 

NM 

-9.72 

5.69 

5.30 

NS 

-7.28 

6.14 

2.85 

NVS 

-0.09 

0.31 

1.68 

ZE 

-0.18 

1.86 

0.14 


Table 3: The membership functions for the 7 force labels in the consequent 


Label 

Center 

Left 

spread 

Right 

spread 

PL 


EQI 


PS 


KHi 


PVS 

1.0 

1.0 

1.0 

NL 

-20.0 

-1.0 

1.0 

NS 

-5.0 

1.0 

1.0 

NVS 

i 

o 

1.0 

1.0 

ZE 

0.0 

1.0 

1.0 
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Table 4: The 13 rules used for set 1 with 7 labels for force. 


P01 

P02 

null 

null 

PL 

P01 

ZE2 

null 

null 

PL 

P01 

NE2 

null 

null 

ZE 

ZE1 

P02 

null 

null 

PS 

ZE1 

ZE2 

null 

null 

ZE 

ZE1 

NE2 

null 

null 

NS 

NE1 

P02 

null 

null 

ZE 

NE1 

ZE2 

null 

null 

NL 

NE1 

NE2 

null 

null 

NL 

VS1 

VS2 

P03 

P04 

PS 

VS1 

VS2 

P03 

PS4 

PVS 

VS1 

VS2 

NE3 

NE4 

NS 

VS1 

VS2 

NE3 

NS4 

NVS 


Table 5: Revised membership functions for force 


Label 

Center 

Left 

spread 

Right 

spread 

PL 

150 

100 

-1 

PM 

90 

120 

0 

PS 

0 

0 

80 

PVS 

0 

0 

20 

NL 

-150 

-1 

100 

NM 

-90 

0 

120 

NS 

0 

80 

0 

NVS 

0 

20 

0 

ZE 

0 

0.2 

0.2 
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Experiment Description 

No. of Trials to learn 

ZE +5., start from 0., learn rate = 0.01 

34 

Same as above, learning rate = 0.1 

89 

Same as above, learning rate = 0.001 

32 

Same as above, learning rate = 0.0001 

85 

ZE(force) +5, +5, +5 

33 

ZE1 +0.2 

0 


7 Discussion 

GARIC’s architecture is similar to the structure proposed by Anderson [2], 
but the action selection network in our architecture is a synthesis of fuzzy 
logic control and neural networks. Using the structure of a fuzzy controller, 
Anderson’s approach is extended to provide for continuous representation of 
the output value and inclusion of the human expert operator’s control rules 
in the action selection network. It should be noted that Anderson’s goal 
in [1] was to discover interesting patterns and strategy-learning schemes. 
Not much effort was spent on making the process learn faster. In our work, 
although we allow some of the strategy learning to occur automatically, we 
start from a knowledge base of fuzzy control rules and time them by learning 
in the neural networks. 

Also, the stochastic action modifier unit in GAB1C has s imil arities to 
Gullapalli’s method [13] although we use a completely different approach for 
defining the internal reinforcement. Lee and Berenji [17] and Lee [16] have 
used a single layer neural network which requires the identification of the 
trace functions for keeping track of the visited states and their evaluations. 
The generation of these trace functions is a difficult task in larger control 
problems. However, the approach suggested in GARIC does not use trace 
functions. The neural network representation of the fuzzy control rules in 
GARIC allows faster development and faster learning. Also, in the single 
layer model, only the generation of the output values were considered. The 
preconditions of the fuzzy control rules were left untouched. However, in 
GARIC, based on reinforcements received from the environment, both the 
preconditions and the conclusions of rules can be modified. 

The ARIC and GARIC architectures both use external reinforcements to 
form internal evaluations of states and control actions. Also, they both use 
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internal reinforcements to guide the process of tuning the rules. However, 
GARIC extends the theory for using reinforcement learning in fuzzy control 
in many respects including: 

• Learning is achieved by full integration of fuzzy inference into a feed- 
forward network, which can then adaptively improve performance by 
using gradient descent methods. 

• The fuzzy memberships used in the definition of the labels are modified 
(tuned) globally in all the rules rather than being locally modified in 
each each individual rule. 

• GARIC can compensate for inappropriate definitions of fuzzy mem- 
bership functions in the antecedent of control rules. We showed this 
attribute by damaging the labels used in the antecedents and observ- 
ing how the system can learn a new control policy to succeed. To the 
best of our knowledge, GARIC is the first architecture to do this. 

• GARIC introduces a new conjunction operator in computing the rule 
strengths of fuzzy control rules. 

• GARIC introduces a new localized mean of maximum (LMOM) method 
in combining the conclusions of several firing control rules. 

• Only monotonic membership functions are used in ARIC. However, 
GARIC allows any type of differentiable membership functions to be 
used in construction of fuzzy logic controller. 

8 Conclusions 

With the GARIC architecture, We have proposed a new way of designing 
and tuning a fuzzy logic controller. The knowledge used by an experienced 
operator in controlling a process cam now be modeled using approximate 
linguistic terms and later refined through the process of learning from ex- 
perience. GARIC provides a well-balanced method for comb inin g the qual- 
itative knowledge of human experts in terms of symbolic rules and learning 
strength of the artificial neural networks. Therefore, we believe that this 
architecture is general enough for use in other rule-based systems which 
perform fuzzy logic inference. 
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Figure 1: A simple architecture of a fuzzy logic controller 


31 









Figure 2: Fuzzification of antecedents using softmin. 
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Figure 3: The Architecture of GARIC. 
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Figure 4: The Action Evaluation Network 
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Figure 5: The Action Selection Network 
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Local Mean of Maximum 


Figure 6: Local Mean-of- Maximum uses the Centroid of the line segment at height w T intercepted 
by the fuzzy membership function, if it is non-monotonic 
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Figure 7: GARIC applied to cart pole balancing. 
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Figure 8: (a)- Four qualitative labels for 6,9,x, and x, (b)- Nine qualitative labels for F. 
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Figure 9: Learning to correct an inappropriate definition of a label’s membership function. 
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Figure 10: The ZE force label was set to +5* The system shifted it back to about 0, which is enough 
for success, given the robustness of the fuzzy inference process. Other labels were also shifted by 
about 1 N, which is minimal change. Start state was non-random. The system learned in 322 trials. 
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(a) Pok Podtion (deg) 




Figure 11: Here the starts are randomized. Other parameters are the sarnie as abovare required to 
learn but not much more. Again, the system brings back the label from 5 to near 0. The system 
learned in 367 trials. 
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(a) Pole Podtloa (de|) 



(b) Cart Position (m) 



Figure 12: Antecedents change: ZE1 +0.2, ZE2 -0.4, P03 -0.1, NS4 -0.1, ConsequenStart: ran- 
domized. The consequents axe ail adjusted by small amounts. ZE is brought back again because it 
is the most critical label in some sense. If this label is not correct, balancing is impossible even if 
the system has started from a good state. Here, the system learned in 312 trials. 
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Figure 13: More damage in addition to above. Antecedents changed : ZEl -0.2, ZE2 -0.4, P03 
-0.1, NE3 -1.0,0,+1 P04 +0.3, +0.3,0 PS4 +0.3 NS4 -0.1. Consequents changed : PL +5, PM +3, 
PS +2, NM +1, ZE -2. Start : randomized. Again, ZE is corrected and the others are shifted by 
comparatively smaller amounts. PM is brought back near to its original value. In general, there is 
less pressure to correct the less-frequently used labels (such as PL), since once the system is in a 
good area of state-space, it can completely avoid using PL after it has made the important repairs. 
Learning took 550 trials. 
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(a) Pole Podtion (deg) 


(b) Cart Podtian (m) 
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Figure 14: Parameter damage : maximum pole position = 0.1 rad and maximum cart position 
= 0.2 m which have been reduced from 0.2 rad and 2.4 m respectively. Start was randomized. 
Learning took 82 trials. 
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Figure 15: Pole half-length wa a reduced to 0.2 m from 0.5 m, and the start was randomized. Only 
4 trials required for learning. 




(a) Pole Position (deg) 



(b) Cart Pocttloci (m) 



Figure 16: Cart mass was doubled to 2.0 kg. Random starting state. 2 trials were enough for 
success. The PL and NL shift their centers away by small amounts to compensate for the heavier 
cart. Similar effects are observed in other labels. The shifts are all small, since the effect is 
distributed over a large number of parameters. 
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Figure 17: Damages of: P04 (+0.4, +0.4,0), NE4 (+0.2,0, -0.2), PS4 (+0.5, 0,0), NS4 (-0.5, 0,0), 
starting state = 0.1. Learned in 2 trials. 
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Figure 18: Cart mass = 4.3 kg (increased from 1 kg), starting position = 0.1, learning occurred 
3 trials. 
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Figure 19: Damages to all labels of 6 : (+0.2,+0.2,0), (+0.2, 0,0), (-0.2,0, +0.2), (-0.1, 0,0) to 

P01,ZE1,NE1,VS1 respectively. Starting position = 0.05. It took 29 trials to succeed. 






(a) Pole Portion (dag) 



(b) Cart Poaition (m) 



Figure 21: Damage to consequents = (-5,-l,0,5.5,l»0,0). Starting position = -0.19 (which is very 
close to failure). 136 trials were required. 
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Figure 22: Major damage to consequents : PL +40.0, NL +5. Randomized starting positions. 15 
trials were sufficient. 




(») Pole Position (deg) (b) Cert Position (m) 



Figure 23: Max cart position = 0.5 (from 2.4). Randomized starts. Antecedent damage : ZEl 
+0.3, ZE2 0.6, VS2 +0.2, P03 -0.4, NE4 +0.5, PS4 +0.1, NS4 +0.3. Learned in 140 trials. 
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